Methods and apparatus for localized processing within multicore neural networks

ABSTRACT

Methods and apparatus for localized processing within multicore neural networks. Unlike existing solutions that rely on commodity software and hardware to perform “brute force” large scale neural network processing the various techniques described herein map and partition a neural network into the hardware limitations of a target platform. Specifically, the various implementations described herein synergistically leverage localization, sparsity, and distributed scheduling, to enable neural network processing within embedded hardware applications. As described herein, hardware-aware mapping/partitioning enhances neural network performance by e.g., avoiding pin-limited memory accesses, processing data in compressed formats/skipping unnecessary operations, and decoupling scheduling between cores.

PRIORITY

This application claims the benefit of priority to U.S. Provisional Patent Application Ser. No. 63/050,090 filed Jul. 9, 2020 and entitled “METHODS AND APPARATUS FOR LOCALIZED PROCESSING WITHIN MULTICORE NEURAL NETWORKS”, which is incorporated herein by reference in its entirety.

RELATED APPLICATIONS

This application is related to U.S. patent application Ser. No. ______, filed and entitled “METHODS AND APPARATUS FOR MATRIX AND VECTOR STORAGE AND OPERATIONS”, and U.S. patent application Ser. No. ______, filed and entitled “METHODS AND APPARATUS FOR THREAD-BASED SCHEDULING IN MULTICORE NEURAL NETWORKS”, each of which are incorporated herein by reference in its entirety.

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT

This invention was made with Government support under Agreement No. N00014-19-9-0003, awarded by ONR. The Government has certain rights in the invention.

COPYRIGHT

A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all copyright rights whatsoever.

TECHNICAL FIELD

This disclosure relates generally to the field of neural networking. More particularly, the present disclosure is directed to hardware, software, and/or firmware implementations of neural network processing.

DESCRIPTION OF RELATED TECHNOLOGY

Incipient research is directed to so-called “neural network” computing. Unlike traditional computer architectures, neural network processing emulates a network of connected nodes (aka neurons) that loosely model the neuro-biological functionality found in the human brain. While neural network computing is still in its infancy, such technologies already have great promise for e.g., compute rich, low power, and/or continuous processing applications.

Existing neural networks are most commonly emulated within general-purpose programming environments because commodity hardware and software compilers are well understood and readily available. Unfortunately, such implementations suffer from many inefficiencies due to e.g., hardware limitations (e.g., physical connectivity), compiler design, and/or instruction scheduling. Neural networks would be a great fit for parallel processing and distributed computing models; however, corresponding changes to hardware and compilers are needed.

SUMMARY

The present disclosure addresses the foregoing needs by disclosing, inter alia, methods, devices, systems, and computer programs for neural network processing within multicore network processors.

In one aspect, methods and apparatus for operating a multicore neural network architecture are disclosed. One exemplary apparatus embodiment includes: a plurality of cores; one or more memories, where the one or more memories are configured to store a first set of global parameters and a second set of local parameters. In one exemplary embodiment, each core comprises logic configured to: obtain a first sparse vector; perform global neural network processing based on the first sparse vector and the first set of global parameters; perform local neural network processing based on the second set of local parameters and a dense vector that is specific to each core; and sparsify the dense vector to generate a second sparse vector for broadcast to the plurality of cores.

In one such embodiment, a non-transitory computer readable apparatus comprising a storage medium having one or more computer programs stored thereon is disclosed. In one exemplary embodiment, the non-transitory computer readable apparatus includes one or more computer programs that when executed by a processing apparatus, is configured to: obtain a first sparse vector; perform global neural network processing based on the first sparse vector and a first set of global parameters; perform local neural network processing based on a second set of local parameters and a dense vector that is specific to each core; and sparsify the dense vector to generate a second sparse vector for broadcast to a plurality of cores.

In one aspect, methods and apparatus for operating a core of a multicore neural network architecture are disclosed. One exemplary method embodiment includes: receiving an input vector, a global activation vector, a local activation vector a portion of a global weight matrix, and a local matrix, by the core of the multicore architecture; calculating an updated local activation vector based on the input vector, the global activation vector, the local activation vector, the portion of the global weight matrix, and the local matrix; calculating an updated localized portion of an updated global activation vector based on the updated local activation vector; and broadcasting the updated localized portion of an updated global activation vector to other cores of the multicore architecture.

Other features and advantages of the present disclosure will immediately be recognized by persons of ordinary skill in the art with reference to the attached drawings and detailed description of exemplary embodiments as given below.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a graphical representation of a multicore processor architecture, commonly used within the processing arts.

FIG. 2A is a graphical representation of one exemplary multicore architecture, in accordance with the various principles described herein.

FIG. 2B is a graphical representation of the extensible nature of the multicore architecture, in accordance with the various principles described herein.

FIG. 3 is a logical block diagram illustrating the data traffic flow throughout the multicore architecture, in accordance with the principles described herein.

FIG. 4 is a graphical representation of an existing gated recurrent unit (GRU) neural network process that is commonly used within the related arts.

FIG. 5 is a graphical representation of one exemplary modified GRU neural network process that maps locality-based processing based on hardware considerations, in accordance with the various principles described herein.

FIG. 6 is a graphical representation of an exemplary neural network's parameters to be partitioned into a multicore architecture, useful to illustrate aspects of the present disclosure.

FIG. 7 is a graphical representation of re-grouped and stacked neural network parameters, useful to illustrate various aspects of the present disclosure.

FIG. 8 is a graphical representation of partitioning neural network parameters based on identified data dependencies, useful to illustrate aspects of the present disclosure.

FIG. 9 is a graphical representation of neural network parameter distribution within the target multicore architecture, useful to illustrate aspects of the present disclosure.

FIG. 10 is a graphical representation of a partitioned memory footprint that fits within the target multicore architecture, useful to illustrate aspects of the present disclosure.

FIG. 11 is a logical flow diagram of an exemplary method for operating a multicore neural network architecture, in accordance with the various principles described herein.

FIG. 12 is a logical flow diagram of an exemplary method for computing a modified gated recurrent unit (GRU) within a core of a multicore neural network architecture, in accordance with the various principles described herein.

FIGS. 13 and 14 are segments of pseudocode for operating a single core of a multicore neural network, in accordance with the various principles described herein.

FIG. 15 is a logical flow diagram of an exemplary method for optimizing machine models from standard machine learning frameworks, in accordance with the various principles described herein.

FIG. 16 is a graphical representation of an exemplary hierarchy of layers of a machine learning model that have been tagged with heterogenous precision, useful to illustrate aspects of the present disclosure.

FIG. 17 is a logical flow diagram of a method for partitioning and placing code, useful in conjunction with various embodiments described herein.

FIG. 18 is a logical flow diagram of a method for optimizing machine models to operate within a multicore architecture, useful in conjunction with various embodiments described herein.

FIG. 19 is a logical flow diagram of a method for generating assembly code, useful in conjunction with various embodiments described herein.

DETAILED DESCRIPTION

In the following detailed description, reference is made to the accompanying drawings which form a part hereof wherein like numerals designate like parts throughout, and in which is shown, by way of illustration, embodiments that may be practiced. It is to be understood that other embodiments may be utilized, and structural or logical changes may be made without departing from the scope of the present disclosure. Therefore, the following detailed description is not to be taken in a limiting sense, and the scope of embodiments is defined by the appended claims and their equivalents.

Aspects of the disclosure are disclosed in the accompanying description. Alternate embodiments of the present disclosure and their equivalents may be devised without departing from the spirit or scope of the present disclosure. It should be noted that any discussion herein regarding “one embodiment”, “an embodiment”, “an exemplary embodiment”, and the like indicate that the embodiment described may include a particular feature, structure, or characteristic, and that such particular feature, structure, or characteristic may not necessarily be included in every embodiment. In addition, references to the foregoing do not necessarily comprise a reference to the same embodiment. Finally, irrespective of whether it is explicitly described, one of ordinary skill in the art would readily appreciate that each of the particular features, structures, or characteristics of the given embodiments may be utilized in connection or combination with those of any other embodiment discussed herein.

Various operations may be described as multiple discrete actions or operations in turn, in a manner that is most helpful in understanding the claimed subject matter. However, the order of description should not be construed as to imply that these operations are necessarily order dependent. In particular, these operations may not be performed in the order of presentation. Operations described may be performed in a different order than the described embodiment. Various additional operations may be performed and/or described operations may be omitted in additional embodiments.

The Complexity of Software-Based Neural Networks

FIG. 1 is a graphical representation of a multicore processor architecture 100, commonly used within the processing arts. The multicore processor 102 may include one or more cores 112A, 112B . . . 112N. Each core may include logic (e.g., arithmetic logic units (ALUs), registers, etc.) arranged to perform various control and data path operations. Examples of control and data path operations may include without limitation: instruction fetch/instruction decode (IF/ID), operation execution and addressing, memory accesses, and/or data write back. A small amount of frequently used instructions and data may be locally cached “on-chip” for fast access; otherwise, “off-chip” storage provides cost-effective storage of bulk data (104A, 104B . . . 104N).

During operation, the processor cores 112A, 112B . . . 112N read and write computer instructions and/or data from the external memories 104A, 104B . . . 104N via a shared bus interface 106. Each computer instruction (also referred to as an “opcode”) identifies the operation to be sequentially performed based on one or more operands (data, register locations, and/or memory addresses). By linking together sequences of computer instructions, it is possible to compute any computable sequence.

In “general-purpose” computing, the processor cores and memories may be tasked with any arbitrary task. A shared bus architecture and monolithic memory map flexibly allows every core 112A, 112B . . . 112N to access any memory location within the external memories 104A, 104B . . . 104N. As a practical matter, however, the shared bus interface 106 is physically pin-limited; there is a fixed width data bus that services all processor-memory connections one-at-a-time. Limited connectivity can significantly affect performance where multiple cores try to access the memories at the same time. Additionally, local cache sizes are limited; reading and writing to large data structures may require multiple “off-chip” transactions across the pin-limited bus. Finally, “global” data structures cannot be accessed by more than one core at a time (simultaneous access could result in data hazards and race conditions).

Unlike general-purpose computing, so-called “neural network” computing uses biologically-inspired algorithms that take their inspiration from the human brain. Neural networks are characterized by a multi-layered composition of high-dimensional linear and non-linear functions. The intermediate function outputs between layers are known as activations. Neural networks typically contain a large number of parameters that are used for e.g., vector-matrix operations. The parameters are tuned in a gradient descent training process based on known input/output data pairings. After training, the parameters are held constant during deployment as the neural network processes novel input data to execute its trained task. For example, FIG. 1 graphically depicts one exemplary neural network computation that is performed as a vector-matrix multiplication 150. As shown therein, neural activations are modeled as a vector of digital values (a) that are multiplied by a matrix of parameter weights (B) for the neural network; the output (c) corresponds to the output neural activations.

Unfortunately, naïvely allocating neural network processing to the multicore processor architecture 100 is extremely inefficient. Firstly, each of the cores 112A, 112B, . . . 112N must access the complete set of neural network data structures. The vector and matrix dimensions are a function of the number of nodes (neurons) within the neural network, thus neural networks of any significant size exceed data sizes that can be efficiently cached on-chip. As a result, all of the cores 112A, 112B, . . . 112N constantly move data across the pin-limited bus interface 106. Additionally, each of the cores 112A, 112B, . . . 112N read and write to the same data structures (a, B, c) and often block one another.

As a related issue, “Big O” notation is used in the computer arts to classify algorithms according to computational complexity (run time and space requirements O, as a function of input size N.) Big O notation is widely used to describe the limiting behavior of a function as it increases, e.g., processing complexity, memory storage, bandwidth utilization, etc. For example, vector-matrix multiplication has a computational complexity of O(N²) for vector size (N) because each element of the vector must be multiplied by a corresponding element of each row and column of the matrix. Doubling the vector size (N) quadruples the computational complexity (O(N²)).

Referring back to FIG. 1, existing neural networking solutions rely on general-purpose vector-matrix operations. Such solutions often rely on hardware accelerators to perform “brute-force” element-by-element calculation. However, the data structures that are used in neural network processing can be made to be quite sparse (a high ratio of null values.) Brute force vector-matrix operations can be particularly inefficient for sparse data structures because the vast majority of memory reads, vector-matrix multiplications, and memory write-backs are unnecessary (null valued). Furthermore, as neural networks continue to grow in size and complexity, inefficient brute force solutions will quadratically increase in complexity.

Substantial factors in neural network energy consumption may include moving large amounts of data, and storing a large number of parameters in leaky SRAM (static random access memory). Charging and discharging wires to transfer data takes energy. Wire energy costs scale with wire length (e.g., chip area) and is a significant concern for chip design. As a related issue, neural networks are parameter-rich, but on-chip SRAM memory is costly to implement. On-chip SRAM is optimized for performance, not power consumption, so SRAM cells may consume significant amounts of energy even when idle, due to leakage. The combination of these factors can limit neural network adoption; in one specific example, remote applications are often power constrained.

Exemplary Multicore Architecture

The aforementioned complexities of neural network processing have presented significant issues for embedded device implementations. Notably, existing neural network implementations are handled within software, without regard to the underlying hardware platform limitations; unfortunately, physical connectivity (e.g., pin limitations), computational complexity, and/or scheduling overhead present significant obstacles for embedded devices. More directly, improved solutions for handling neural networks in embedded environments are needed; ideally, such solutions should enable compute rich, low power, and/or continuous processing applications.

To these ends, various principles described herein synergistically leverage locality, sparsity, and distributed scheduling, to enable neural network processing within embedded hardware applications. Unlike existing solutions that rely on commodity software and hardware to perform “brute force” large scale neural network processing; the various techniques described herein map and partition a neural network based on the hardware limitations of a target platform. The exemplary hardware-aware mapping/partitioning described herein enhances neural network performance by e.g., avoiding pin-limited memory accesses, processing data in compressed formats/skipping unnecessary operations, and distributing task scheduling while decoupling timing requirements between cores.

In a first aspect, hardware-aware mapping and partitioning may be used to minimize data transfers and parameter storage requirements. In one embodiment, neural networking parameters and activations may be sparsified to fit the neural network within chip constraints without performance degradation; in other embodiments, the neural network may be sparsified and trained for acceptable levels of performance degradation. As described in greater detail herein, fewer parameters and activations may result in fewer memory lookups and data transfers. By avoiding costly off-chip data transfers and minimizing chip area, the chip can minimize both static power and dynamic energy costs.

In a second aspect, hardware-aware mapping and partitioning may be used to localize parameters to where they are used. Various embodiments of the present disclosure enable a “compute-near memory” approach where computations are distributed across multiple cores to strategically co-locate parameters with their associated logic. Co-locating data and processing reduce data transfers across the chip and only requires a small set of parameters for each core to locally process and store.

Furthermore, the exemplary hardware-aware mapping and partitioning techniques may exploit a high degree of parallelism to complete tasks quickly and maximize the time spent in low-power sleep states to mitigate leakage. In one embodiment, the multicore architecture comprises a number of variable-length Single-Instruction-Multiple-Data (SIMD) cores that can perform the same operations on multiple data elements via parallel data paths (e.g. a matrix-vector multiply or a pointwise non-linearity on a vector). During operation, the data paths may each operate in parallel so multiple instructions can execute simultaneously in a core. Likewise, each core of the multicore processor may operate in parallel, communicating with other cores only when necessary.

Other optimizations described herein may manage thread scheduling during compile-time and program-time, rather than run-time. In other words, rather than using a centralized scheduler that is evaluated at “run-time”, the neural network is compiled at “compile-time” into threads; threads and their thread dependency count are distributed to each core at program-time. The core can run the thread at run-time without any scheduling conflict. Certain implementations may also leverage instruction-level support for sparse vector-matrix operations.

FIG. 2A is a graphical representation of one exemplary multicore architecture 200, in accordance with the various principles described herein. As shown, the architecture 200 does not use an external memory to store the neural network data structures nor any intermediate results. Instead, each core includes its own processing hardware (212A, 212B, 212C, 212D), local weights (214A, 214B, 214C, 214D), global weights (216A, 216B, 216C, 216D), working memory (218A, 218B, 218C, 218D), and accumulator (220A, 220B, 220C, 220D). While the following discussion is presented in the context of a core with its own dedicated memories, the techniques described herein may be used in shared memory systems and/or hybrids thereof. More generally, dedicated core resources may enable improved core performance whereas shared resources across cores may provide flexibility and/or cross-core communication opportunities.

Unlike existing neural network processors which naïvely distribute processing load (discussed above), the exemplary multicore architecture decouples processing among the cores. In one aspect of the present disclosure, neural network processing is mathematically transformed (mapped) and spatially partitioned into dense “neighborhood” processing and sparse “global” communications processing (see e.g., Techniques for Targeting a Neural Network to the Multicore Architecture). As described in greater detail hereinafter, the mapping/partitioning may be based on the physical processing hardware-memory connectivity; in other words, processing hardware and memories transactions may be mapped/partitioned so that they are not pin-limited. The mapping/partitioning preserves the properties of the original global neural network at a fraction of the memory accesses.

As shown in FIG. 2A, the local neighborhood weights are stored in the local weight memories (214A, 214B, 214C, 214D) and each core's subset (or “slice”) of the global network weights are stored in the global weight memories (216A, 216B, 216C, 216D). During operation, applicable weights are retrieved from the corresponding memory for computation; intermediate results may be stored within a working memory (218A, 218B, 218C, 218D) and/or accumulator (220A, 220B, 220C, 220D).

While the illustrated embodiment is shown in the context of four (4) cores emulating a global neural network of nodes, the multicore architecture described herein may be broadly extended to any number of cores and/or any number of nodes (see e.g., FIG. 2B). Additionally, the foregoing discussion presented a symmetric distribution, however asymmetric distributions may be substituted with equal success. Partitioning may be scaled to individual core's capabilities and/or application requirements. For example, asymmetric systems may enable high performance cores (more logic, memory, and/or faster clock rates) and low power cores (less logic, less memory, and/or power efficient clocking). In such implementations, matrix operations may be sized to complete within operational constraints, given a core's capabilities. Furthermore, any consolidation, division, distribution, agglomeration, and/or combination of processing hardware and/or memory may be substituted by artisans of ordinary skill in the related arts, given the contents of the present disclosure.

FIG. 3 is a logical block diagram illustrating the data traffic flow 300 throughout the multicore architecture, in accordance with the various principles described herein. Each neighborhood (302A, 302B, 302C, 302D) is characterized by a locally dense neural network. Neighborhoods are connected via a global interconnect matrix (304A, 304B, 304C, 304D) to the other neighborhoods; the output of the neighborhoods can be further sparsified prior to global distribution via interconnect logic (306A, 306B, 306C, 306D).

Various aspects described herein synergistically leverage globally sparse, locally dense connectivity to attain a variety of benefits heretofore unrealized. For instance, existing neural network techniques naïvely store, and brute force process, every matrix element in memory (whether connected or not). Naive (hardware agnostic) storage requires O(N²) memory for an N×N matrix, which is considerably more than necessary for a sparse matrix with very few non-null elements. Similarly, brute force calculation quadratically increases in complexity as a function of network size (regardless of the matrix's sparsity). In contrast, one exemplary embodiment compresses sparse neural network data structures based on actual, non-null, connectivity (rather than all possible connections). This greatly reduces storage requirements as well as computational complexity. In one such variant, the compression and reduction in complexity is sized to fit within the memory footprint and processing capabilities of a core.

As a further optimization, there are overhead costs associated with compression, and different techniques have different costs and benefits. Since vectors and matrices are used differently in neural network processing, these data structures may be represented differently to further enhance performance. For example, as discussed in U.S. patent application Ser. No. ______, filed ______ and entitled “METHODS AND APPARATUS FOR MATRIX AND VECTOR STORAGE AND OPERATIONS”, previously incorporated herein by reference in its entirety, exemplary embodiments compress sparse neural network data structures based on actual, non-null, connectivity (rather than all possible connections). This greatly reduces storage requirements as well as computational complexity. In some variants, the compression and reduction in complexity is sized to fit within the memory footprint and processing capabilities of a core. The exemplary compression schemes represent sparse matrices with links to compressed column data structures, where each compressed column data structure only stores non-null entries to optimize column-based lookups of non-null entries. Similarly, sparse vector addressing skips nulled entries to optimize for vector-specific non-null multiply-accumulate operations.

Additionally, existing neural network processing relies on a centralized task scheduler that consumes significant processing and transactional overhead to coordinate between cores. In contrast, the sparse global communications between cores of the exemplary multicore architecture decouples neighborhood processing and enables the multicore architecture to asynchronously operate the cores in parallel. Consequently, optimized variants may distribute task coordination between cores and implement asynchronous handshaking protocols between cores. For example, as discussed in U.S. patent application Ser. No. ______, filed ______ and entitled “METHODS AND APPARATUS FOR THREAD-BASED SCHEDULING IN MULTICORE NEURAL NETWORKS”, previously incorporated herein by reference in its entirety, thread-level parallelism and asynchronous handshaking are leveraged to decouple core-to-core dependencies. The principles described therein enable threads to run independently of one another, without any centralized scheduling and/or resource locking (e.g., semaphore signaling, critical path execution, etc.) Decoupling thread dependencies allows cores to execute threads asynchronously. In one such implementation, the multicore architecture includes a set of distributed cores that run in parallel. The cores communicate with each other via an interconnecting network of router nodes. Each core processes its threads asynchronously with respect to the other cores. Most threads correspond to the dense neighborhood, and the core can process these threads independently of the other cores. Global communication is sparse (infrequent) and is handled via an asynchronous handshake protocol.

Techniques for Targeting a Neural Network to the Multicore Architecture

In a first aspect of the present disclosure, a global neural network is mapped into a set of sparsely interconnected, dense neighborhood neural networks that are partitioned based on hardware platform constraints. In one exemplary embodiment, the transformation may be performed on a modified gated recurrent unit (GRU). Alternative implementations may perform the transformation on modified Long Short-Term Memory (LSTM) or any other “remember-forget” recurrent neural network (RNN) logic. More generally, any logic or component that retains/removes information between nodes of the neural network may be modified to transform a first domain (first vector space) to a second domain (second vector space).

As used herein, the terms “transform”, “transformation”, “map”, “mapping”, and/or other linguistic derivations thereof refer to mathematical or algorithmic functions that relate a function from a first domain (vector space) to a second domain (vector space). Transforms may be linear or non-linear; linearity is present where the mathematical functions of addition and scaling are preserved (i.e., the result of multiple operators considered together is the same as the sum of the operations considered individually). Similarly, transformations may be lossless (reversible) or lossy (irreversible); for example, lossy transformations may e.g., reduce unnecessary precision, decimate values to add sparsity, etc. More generally, while illustrative examples of linear matrix transformations are described below, any algorithmic transformation may be substituted with equal success by artisans of ordinary skill in the related arts.

As used herein, the terms “partition”, “partitioning”, “place”, “placing”, and/or other linguistic derivations thereof refer to the allocation and assignment of hardware to perform algorithms or logic. For example, the dataflow may be partitioned into a multicore architecture by assigning specific functions (neighborhoods) to a specific core. Partitioning may be implemented within software (e.g., non-transitory computer readable instructions executed by processing logic), within hardware (e.g., logic gates and/or sequential logic), or some combination thereof (e.g., firmware, etc.)

As a brief aside, so-called “backpropagation” refers to neural network processing techniques that use error information in supervised learning. Recurrent neural networks (RNNs) are an example of one type of neural network processing that benefits from backpropagation techniques. During operation, data propagates “forward” through the nodes of the network (the RNN is a temporal directed graph), error information (gradient information) is used to improve the network's weighting and propagated “backward”.

The temporal nature of recurrent neural networks (RNNs) allows the RNN to exhibit dynamic behavior over time. As a practical matter, temporally recent gradient information has a greater influence on behavior; however, over time, the gradient diminishes in importance (also referred to as a “vanishing gradient.”) Various techniques are used in RNNs to optimize the amount of information that is “remembered” or “forgotten” in the network.

Gated recurrent units (GRUs) are commonly used in recurrent neural networks (RNNs) to retain/remove gradient information. FIG. 4 is a logical representation of an existing GRU process 400 that is commonly used within the related arts. During operation, the input vector (x) is modified based on previous activation vectors (h). The exemplary GRU process 400 uses a hyperbolic tangent (tanh) function to positively or negatively reinforce network state information (reinforcement may range from +1 to −1), and a sigmoid to “remember” (+1) or “forget” (0) network state information.

As shown therein, the input vector at time t (x_(t)) is multiplied by a first set of input weights (W_(ir)) and the previous activation vector (h_(t−1)) is multiplied by a first set of network weights (W_(hr)). The first result is summed and scaled according to a first sigmoid non-linearity (sigmoid). This step is described by the equation:

r _(t)=σ(W _(ir) x _(t) +W _(hr) h _(t−1))   EQN 1:

Additionally, the input vector and previous activation vector are also multiplied by a second set of input weights (W_(iz)) a second set of network weights (W_(hz)). The second result is summed and scaled according to a second sigmoid non-linearity (σ). This step is described by the equation:

z _(t)=σ(W _(iz) x _(t) +W _(hz) h _(t−1))   EQN 2:

The input vector is multiplied by a third set of input weights (W_(in)) and the result of EQN 1 (r_(t)) is multiplied by the previous activation vector state and a third set of network weights (W_(hn)). The result is summed and scaled according to a hyperbolic tangent non-linearity (tanh). This step is described by equation 3 (or 3′):

n _(t)=tanh(W _(in) x _(t) +r _(t) *W _(hn) h _(t−1))   EQN 3:

n _(t)=tanh(W _(in) x _(t) +W _(hn)(r _(t) *h _(t−1)))   EQN 3′:

The results of the foregoing processes are further mixed via element-wise mixers. Each element-wise mixer takes a series of inputs (x₀, x₁) and mixes the outputs according to the select (s) to generate an output (y), as described by the following equation:

y(i)=s(i)*x ₀(i)+(1−s(0)*x ₁(i)   EQN 4:

The resulting activation vector (ht) of the GRU process 400 is given by the following equation:

h _(t)=(1−z _(t))*n _(t) +z _(t) * h _(t−1)   EQN 5:

Notably, all of the GRU's parameters are in global neural network matrices and the parameters are accessed at every timestep. In other words, the GRU process quadratically scales as a function of the neural network's size (O(N²)). Furthermore, the aforementioned GRU process does not account for hardware platform limitations. Thus, existing GRU implementations are poorly suited for embedded devices that are limited to small memory footprints, reduced processing capabilities, and/or limited power.

Referring now to FIG. 5, one exemplary dataflow 500 that enables locality-based processing within a specific hardware platform is shown. The exemplary dataflow 500 may be used to map an “original” global neural network (hardware platform agnostic) into a functionally identical set of sparsely interconnected, dense neighborhood neural networks. The exemplary dataflow 500 includes: a modified gated recurrent unit (GRU), block diagonal matrices D 502 that correspond to each densely connected neighborhood, and global matrices W 504 that correspond to sparse global connectivity. The dense, local matrices D 502 take a neighborhood activation vector (h_(t−1)) as input, whereas the sparse, global matrices W 504 take both a sparsified input vector (x′_(t)) and a sparsified global activation vector (h′_(t−1)). In the illustrated embodiment, a rectified linear unit (ReLU) 506 sparsifies the neighborhood activation vector (h_(t−1)) to produce the next sparse global activation vectors (h′_(t)).

Conceptually, the aforementioned transformation (map) divides the neural network processing into portions that are amenable for distribution among multiple cores. However, the modified GRU still “touches” different portions of the neural network. Thus, the exemplary embodiment further partitions the mapped neural network to ensure that each core has local access its slice of neural network parameter weights.

As a preliminary step, FIG. 6 depicts a graphical representation of an exemplary neural network's unpartitioned (naïvely mapped) weight matrices. For illustrative purposes, the global matrices for tanh (positive/negative reinforcement) and sigmoid (remember/forget) functions (W_(hr), W_(hz), W_(hn), W_(er), W_(iz), W_(in)) emulate a neural network of sixty-four (64) nodes (an assumed sparsity of ˜10% for activation vectors (x, h) and parameters (W) is shown). Also included are block diagonal matrices (D_(hr), D_(hz), D_(hn)) that correspond to the naïve mapping to the target multicore architecture of eight (8) cores. In the naïve mapping, each core handles 1/8th of the processing burden.

The matrix operations of FIG. 6 can be re-grouped and described as a series of stacked global operations. FIG. 7 illustrates the re-grouped and stacked global operations, mathematically described as follows:

$\begin{matrix} {W = \begin{bmatrix} W_{ir} & W_{hr} \\ W_{iz} & W_{hz} \\ W_{in} & W_{hn} \end{bmatrix}} & {{EQN}\mspace{14mu} 6} \\ {D = \begin{bmatrix} D_{hr} \\ D_{hz} \\ D_{hn} \end{bmatrix}} & {{EQN}\mspace{14mu} 7} \end{matrix}$

Once re-grouped and stacked, the naïve mapping can be mathematically simplified to the following global equations:

i′=[x′ _(t) h′_(t−1)]  EQN 8:

[a b c]=W′i   EQN 9:

[d e f]=Dh _(t−1)   EQN 10:

r _(t)=σ(a+d)   EQN 11:

z _(t)=σ(b+e)   EQN 12:

n _(t)=tanh(c+r _(t) *f)   EQN 13:

h _(t)=(1−z _(t))*n _(t) +z _(t) *h _(t−1)   EQN 14:

h′ _(t)=ReLU(h _(t))   EQN 15:

Mathematically, EQNS. 9-15 may also be restated as follows:

r _(t)=σ(W _(ir) x′ _(t) +W _(hr) h′ _(t−1) +D _(hr) h _(t−1))   EQN 16:

z _(t)=σ(W _(ir) x′ _(t) +W _(hr) h′ _(t−1) +D _(hr) h _(t−1))   EQN 17:

n _(t)=tanh(W _(in) x′ _(t) +W _(hn) h′ _(t−1) +r _(t)*(D _(hn) H _(t−1)))   EQN 18:

h _(t)=(1−z _(t))*n _(t) +z _(t) *h _(t−1)   EQN 19:

h′ _(t)=ReLU(h _(t))   EQN 20:

FIG. 7 shows that any neural network that is naïvely mapped to a target multicore architecture may be fully expanded to identify the data dependencies of a neural network, e.g., W_(hr) 702 is always multiplied by the global activation vector h′_(t−1). Additionally, each block (or submatrix) of the block diagonal D matrix only touches a small portion of the global activations; for example, one processor core's blocks of the block diagonal matrices (D_(hr), D_(hz), D_(hn)) only touches its corresponding portions of the global activations resulting from the global weight operations (W_(hr), W_(hz), W_(hn), W_(ir), W_(iz), W_(in)), identified in bands 704A, 704B, 704C.

Referring now to FIG. 8, the neural network parameters may be partitioned based on the identified data dependencies. The neural network parameters have been partitioned and re-grouped such that data dependencies for each neighborhood are lumped together. Specifically, each cores' block of the block diagonal matrix (D_(hr), D_(hz), D_(hn)) are grouped together, and the cores' corresponding portion of global neural network parameters (W_(hr), W_(hz), W_(hn), W_(ir), W_(iz), W_(in)) are grouped together. FIG. 9 provides a graphical illustration of how the neural network parameters of FIG. 8 may be distributed within the target multicore architecture. Each cluster core locally stores its blocks of the block diagonal matrix; however, the global neural network parameters share common activation vectors (h′_(t)) and input vectors (x′_(t)) that can be stored in aggregate.

In contrast to FIG. 9, FIG. 10 illustrates a partitioned memory footprint that fits within the local memory for each processor core. Notably, the sparse activation vectors (h′_(t)) and input vectors (x′_(t)) are the only remaining core-to-core data dependency for the target multicore architecture. As a result, all the core-to-core data dependencies are satisfied by the communication of sparse data alone: by broadcasting the input vectors (x′_(t)) to the cores, and each core broadcasting its portion of the sparse activation vector (h′_(t)). As shown therein, each local memory contains both the core's blocks of the block diagonal matrices (D_(hr), D_(hz), D_(hn)) and global neural network parameters (W_(ir), W_(iz), W_(in) and W_(hr), W_(hz), W_(hn)) which take their respective inputs (h′_(t), x′_(t)). In the illustrated embodiment, the multicore processor does not need an external memory and wholly avoids the aforementioned pin-limitations of fixed width data busses for external memories. The foregoing techniques illustrate one exemplary mapping/partitioning based on on-chip connectivity of the exemplary multicore architecture, however virtually any mapping/partitioning technique may be substituted with equal success by artisans of ordinary skill in the related arts.

Notably, the memory footprint and processing complexity for each neighborhood is a fraction of the equivalent global neural network. Consequently, the mapping/partitioning principles described above may be further extended to segment the global neural network to accommodate the capabilities of any hardware platform. More generally, while the illustrated embodiment is shown in the context of four (4) cores emulating a global neural network of 128 nodes, the multicore architecture described herein may be broadly extended to any number of cores and/or any number of nodes. Additionally, the foregoing discussion presented a symmetric distribution, however asymmetric distributions may be substituted with equal success. Partitioning may be scaled to individual core's capabilities and/or application requirements. For example, asymmetric systems may enable high performance cores (more logic, memory, and/or faster clock rates) and low power cores (less logic, less memory, and/or power efficient clocking). In such implementations, matrix operations may be sized to complete within operational constraints, given a core's capabilities. Furthermore, any consolidation, division, distribution, agglomeration, and/or combination of processing hardware and/or memory may be substituted by artisans of ordinary skill in the related arts, given the contents of the present disclosure.

Methods

Referring to method 1100 of FIG. 11, a logical flow diagram of an exemplary method 1100 for operating a multicore neural network architecture is shown. In one embodiment, the multicore neural network architecture includes locally dense neural networks that are connected via sparse global interconnects.

In one embodiment, a set of global parameters and a set of local parameters are stored in memories associated with each core. In one exemplary embodiment, the global parameters define the global interconnection and the local parameters define local neural network processing (e.g., global/local weights W_(ir), W_(in) and W_(hr), W_(hz), W_(hn) and block-diagonal matrices D_(hr), D_(hz), D_(hn)). In some variants, the global parameters correspond to interconnections between nodes of the target multicore architecture based on spatial organization within a neighborhood. In some variants, the local parameters are mapped to neighborhoods of the target multicore architecture based on similarity of data dependency. While the exemplary embodiments are presented in the context of hardware-aware global/local mapping and partitioning, the operations described herein may be used with naïve mappings (hardware agnostic) to the target multicore architecture.

Different sets of global parameters and local parameters are stored in each core. Together the set of global parameters in all cores may be logically equivalent to a single set of global parameters but are instead arranged for local processing which avoids pin-limited, energetically expensive memory accesses from a large, shared memory. In one embodiment, the parameters are weight values for matrix-vector multiplications which may be used in conjunction with activation functions.

In one exemplary embodiment, the activation functions are hyperbolic tangent tanh (positive/negative reinforcement) and sigmoid (remember/forget) functions. Other examples of activation functions include, without limitation: identity, binary step, logistic (sigmoid or soft step variants), hyperbolic tangent and its variants, rectified linear unit (ReLU), Gaussian Error Linear Unit (GELU), Noisy ReLU, Leaky ReLU, Parametric ReLU, Exponential Linear Unit (ELU), Softmax, and/or any other activation function used in the neural processing arts.

As used herein, memory associated with a core may include memory resident on the core itself (e.g., registers, accumulators, etc. as shown in FIG. 2) and/or memory that is connected directly to the core (not via a shared bus). In an exemplary embodiment, this memory may include dedicated memory that is configured to store local neighborhood weights and each core's subset of the global network weights.

The terms “sparse,” “sparsity,” “sparsifying,” and “adding sparsity” refers to a dimensional distribution that skips elements of and/or adds null elements to a set. Skipping or adding null elements to a data structure may be achieved with any suitable activation function (e.g., a rectifier linear unit (ReLU). More generally, any activation function that inserts (or can be used to insert) null elements may be substituted to the same end. While the present disclosure is primarily directed to sparsity in spatial dimensions, artisans of ordinary skill in the related arts will readily appreciate that other schemes for adding sparsity (e.g., spatial, temporal, frequency, and other hybrids/variants thereof) may be substituted with equivalent success. A variety of other data structures may be used for representing sparse data structures, the aforementioned vectors and matrices being purely illustrative.

Generally, a combination of matrices can be used to emulate a neural network of nodes within a number of cores (C); where the dense local matrices are of dimension NPC×NPC (nodes per core (NPC)) and the sparse global interconnects are of dimension NPC×(C×NPC). Notably, the foregoing discussion is presented in the context of a two (2) tiered hierarchy (neighborhood, global), however the techniques described herein may be extrapolated to any higher order degree with equal success. For example, a device (or multitude of devices) may support a four-tiered topology comprising: a “neighborhood”, that is part of a “city” (with neighborhoods per city (NePCi), which itself is part of a “state” (cities per state (CiPSt), within a “global” configuration. Such a configuration would include neighborhoods of NPC×NPC, city interconnect matrices of NPC×(NePCi×NPC), state interconnects of (NePCi×NPC)×(CiPSt×NePCi×NPC), etc. In other words, local unit outputs are sparse and broadcast to all other local units at each level of the hierarchy. Functionally, the hierarchy of memory-plus-processing enables each level to provide more connections with increasing sparsity to keep communication costs from ballooning. More directly, each additional layer of hierarchy enables a broader set of connectivity and opportunities for partitioning according to the available hardware platform considerations.

While illustrative embodiments of the present disclosure are described in the context of symmetric operation (e.g., each core of the multicore architecture is assigned the same number of nodes), other embodiments may asymmetrically assign nodes. For example, some devices may have a performance core which can support a greater number of logical nodes, and a power saving core that can support a fewer number of nodes at greatly reduced power consumption. Asymmetric node operation may result in different parameterizations; for example, four cores respectively supporting N₁, N₂, N₃, N₄ node networks would have N₁×N₁, N₂×N₂, N₃×N₃, and N₄×N₄ dense local matrices; global interconnect matrices would be sized accordingly; e.g., N₁×(N₁+N₂+N₃+N₄), N₂×(N₁×N₂+N₃+N₄), etc.)

Furthermore, the principles described herein are not limited to embedded devices or even self-contained devices. The various concepts described herein may be extended to any neural networking application that benefits from localization of processing. As but one such example, several devices may be networked (via wired or wireless connectivity) to enable neural network sizes that greatly exceed the capabilities of any of the individual devices. In such a multi-device configuration, each device may have their own localized “neighborhood” and communicate with the global network of devices via sparse network communications. For example, a network of neighborhood devices may be in coordination with a city device, the city device may be part of a larger state network, etc.

While the following steps are described as occurring on a core of a multicore architecture, a plurality of cores may perform the steps of method 1100 in parallel on their different set of global and local parameters.

At step 1102 of the method 1100, a core of a multicore architecture may obtain an input vector. The input vector may be a sparse input vector (e.g., input vector x′_(t)). A core may receive the sparse vector from a broadcast to all cores. Exemplary embodiments are configured such that input vector x′_(t) (along with shared activation vectors (h′_(t))) may be the only core-to-core dependencies in the multicore architecture. More generally however, artisans of ordinary skill in the related arts, given the contents of the present disclosure will readily appreciate that less optimal transformations may allow (or require) cores to communicate and/or share other vectors. Such implementations may be preferable where hardware agnostic fitting is infeasible and/or unnecessary. While overall device performance may suffer, such performance reductions may be preferred in view of other holistic system constraints (e.g., convenience, breadth of deployment, versatility, code/network re-use, etc.)

While the foregoing discussion is presented in the context of a sparse vector, the concepts described herein may be broadly extended to any data structure, whether fixed or variable, sparse or dense. As a brief aside, the sparsity/density of a data structure may be calculated by dividing the number of non-null/non-empty elements to the total number of elements. For example, a sparse vector or matrix may have a sparsity of 0.1 (only 10% of the values are non-null) whereas a dense vector or matrix may have a density of 0.9 (90% of the values are non-null).

Conceptually, sparsity is a representation of an actual connectivity versus the potential connectivity of a neural network. A sparse neural network could represent many potential connections (of which only a few are actually connected). In contrast, a dense neural network can only represent a few potential connections (but most are actually connected). While the various examples are presented in the context of illustrative sparsity and density values (e.g., 0.1/0.9), the techniques described herein broadly apply any such sparsity/density combination (e.g., 0.2/0.8, 0.3/0.7, 0.5/0.5, etc.) Further discussions of the benefits and tradeoffs associated with sparsity are described in greater detail hereinafter (see e.g., Operational Efficiency Tradeoffs, Sparsity and Density).

Additionally, while the foregoing scheme is presented in the context of broadcast signaling, other implementations may use e.g., one-to-one (unicast), one-to-many (multicast), many-to-one, and/or many-to-many, and/or any other communication variant. Such implementations may be used to e.g., prioritize connectivity and/or subdivide core operation. For example, a four (4) core device may operate with all four cores (a large network) or emulate two smaller networks; this may be useful for devices that dynamically switch between multiple applications.

At step 1104 of the method 1100, the core may perform global neural network processing. In some embodiments, a set of global parameters (a matrix) may be multiplied by an input vector. In some cases, the input vector may be sparsified; in alternative variants, the input vector may be processed without sparsification.

Within the context of the present disclosure, the terms “global”, “globalization”, “globalized” and other linguistic variants thereof, refer to processing, signaling, and/or other associated resources (signals, logic, and memory) that may be propagated to, received from, shared with, or otherwise affect all other cores of a plurality of cores. As but one specific example, global signaling may be broadcast to all cores of the multicore architecture.

In an exemplary embodiment, the input vector may be combined (concatenated) with or include a global activation vector. In this exemplary embodiment, the set of global parameters are multiplied by the combination input vector and global activation vector. As previously described, the global activation vector may be assembled by the core from broadcast portions of the global activation vector from each of the plurality of cores from the previous operations of each core of the multicore architecture. The resulting matrix of values may be used as inputs to the one or more activation functions in the core.

At step 1106 of the method 1100, the core may perform local neural network processing. In some embodiments, a set of local parameters may be multiplied by a local vector that is specific to each core. In one exemplary implementation, the local vector may be dense. The dense vector that is specific to each core may include the previous activation vector. During operation, the local neural network processing may include multiplying the set of local parameters with the dense vector.

Within the context of the present disclosure, the terms “local”, “localization”, “localized” and other linguistic variants thereof, refer to processing, signaling, and/or other associated resources (signals, logic, and memory) that are not propagated to, received from, shared with, or otherwise affect other cores of a plurality of cores. For example, local parameters may be exclusive to a single core and stored spatially near the corresponding core. While the illustrated examples are presented in the context of localization to a single core, other variants may e.g., localize processing to multiple cores (e.g., in a four-core architecture, processing may be localized to a pair of cores, etc.)

In an exemplary embodiment, each physical core of the multicore architecture is assigned to a logical neighborhood or cluster of neurons (e.g., 2, 4, 8, 16, 32, etc. neurons). Each neighborhood or cluster of neurons shares a common memory which includes the set of global parameters and the set of local parameters. In a variant, multiple neighborhoods or clusters of neurons share a single core. This allows for low cost (e.g., energy, bus bandwidth) dense communications between neurons within a neighborhood while maintaining the benefits of global parameter sparsity.

The resulting matrix of values from the multiplication of the set of local parameters and the dense vector may be used as e.g., inputs to the one or more activation functions in the core. For example, the resulting matrix may be split into three different components to provide inputs into two sigmoid and hyperbolic tangent activation functions. Still other combinations of activation functions may be substituted by artisans of ordinary skill in the related arts, given the contents of the present disclosure.

At step 1108 of the method 1100, the core may generate a result vector for distribution to the plurality of cores. In one embodiment, the result vector is based on a dense vector generated by a modified gated recurrent unit (GRU) that combines a dense activation vector from previous neighborhood network activity with sparsified input and a sparse activation vector from previous global network activity. The various combinations of dense neighborhood/sparse global are remembered-forgotten (sigmoid) and positively or negatively reinforced (tanh); for example, a first component (r_(t)) may be generated by remembering/forgetting a weighted sum of the sparsified input (x′_(t)) and a dense activation vector from previous neighborhood network activity (h_(t−1)) (as described above in EQNS. 11 and 16).

In one exemplary embodiment, result of the modified GRU may be locally re-circulated in its dense form, and sparsified for global distribution to other cores of the multicore architecture. Sparsification may include e.g., skipping elements or adding null elements to the dense vector. In an exemplary embodiment, a rectified linear unit (ReLU) activation function is applied to the resulting dense vector to create a portion of the globally sparse activation vector.

In one embodiment, the globally sparse activation vector is an asynchronous combination of sparsified results from the network's constituent cores. In other words, each core updates its portion of the globally sparse activation vector without regard to the timing of the other cores (updates to the globally sparse activation vector are not synchronized). The core may broadcast its core specific portion of the globally sparse activation vector to each core of the multicore architecture. The broadcast core specific portion of the activation may go into a messaging queue or buffer that is specific to each core for use. Each core's specific portion of the activation vector may then be retrieved from the queue or buffer and used to assemble the next activation vector. In one such implementation, each core is assigned a core identifier and transmits the core identifier with the core specific portion of the activation vector. The next activation vector may be assembled in core identifier order. In a similar embodiment, the core specific portion of the activation vector may be sent by the core to a shared messaging queue or buffer associated with the multicore architecture where it is combined with the core specific portion of the activation from other cores in the multicore architecture. A separate core, a scheduler/controller, or other processing logic assembles the activation vector and broadcasts the assembled activation vector to each core. Still other signaling schemes may be substituted with equal success, by artisans of ordinary skill in the related arts, given the contents of the present disclosure. For example, some variants may use synchronous updates, directed signaling, and/or local or global queuing mechanisms.

Once completed, the core may begin work on the next available task which may include obtaining a sparse vector (e.g., returning to step 1102 of the method 1100).

More generally, the various techniques described herein may be broadly applied to any recurrent neural network (RNN) that can be mathematically transformed and apportioned into localized processing (to various degrees) and globalized processing. Localized processing may be compressed into densely connected networks to optimally utilize local core resources. Globalized processing may be sparsified and logically distributed to the other cores of the network. Other RNNs that may benefit from the various concepts described herein include, without limitation, traditional RNNs, LSTM-based RNN (and variants), GRU-based RNN (and its variants).

Referring now to method 1200 of FIG. 12, a logical flow diagram of an exemplary method 1200 for performing a modified gated recurrent unit (GRU) within a core of a multicore neural network architecture is shown. GRUs are characterized by “update” information (a combination of remembered/forgotten node state and input), “candidate” information (positive/negative reinforcement via e.g., a tanh), and the “cell” or “node” state. While the following discussion is performed within non-transitory computer-readable media (software), hardware and/or firmware implementations of the modified-GRU process may be substituted with equal success, by artisans of ordinary skill (given the contents of the present disclosure).

At step 1202 of the method 1200, a core of a multicore architecture receives an input vector (e.g., x′ of FIG. 5) and a global activation vector (e.g., h′_(t−1)). In an exemplary embodiment, the core receives the input vector (e.g., x′) from external stimulus to the multicore architecture. External stimulus may include audio, video, and/or other sensed metrics. For example, a hearing aid and/or earbuds may include a microphone that captures acoustic data as input. Similarly, a mobile device may include a camera that provides visual data as input. Other sensors may include e.g., acoustic, sound, vibration, electromagnetic, chemical, temporal, spatial, positioning, acceleration, etc. In another example, external stimulus may include streaming text data for NLP processing (e.g., language translation, language understanding) or video data and may be performed on a mobile phone, an augmented reality/virtual reality (AR/VR) goggle/headset, a wearable (e.g., smart watch), a laptop, etc.

In one exemplary embodiment, the global activation vector (e.g., h′_(t−1)) that represents the previous state of the network nodes, is received as asynchronous broadcast communication from a plurality of cores of the multicore architecture. As previously noted, sparsity in the global activation vector (as well as the input vector) allows the data to be compressed as it is transferred between cores as well as allowing a reduction in parameters and access counts.

In one exemplary embodiment, core-specific parameters are retrieved from local memories. For example, a local activation vector (e.g., h_(t−1) ), a portion of a global weight matrix (e.g., W_(ir), W_(iz), W_(in) and W_(hr), W_(hz), W_(hn)), and their corresponding local matrices (e.g., D_(hr), D_(hz), D_(hn)) can be retrieved from a memory associated with the core. The memory may be associated with the core alone (not any other core of the multicore architecture) and spatially localized thereto; dedicated access and on-chip proximity greatly improve memory bandwidth.

At step 1204 of the method 1200, the core of the multicore architecture calculates an update its neighborhood state e.g., in accordance with the modified-GRU process described above (see e.g., EQNS. 8-15, also restated as EQNS. 16-20). For example, to calculate update information (described in EQN. 9), the core of the multicore architecture concatenates the input vector (x′) and the global activation vector (h′_(t−1)) forming a concatenated sparse vector (xh). The core then may multiply various portions of the global weight matrix (W_(ir), W_(iz), W_(in)) with the concatenated sparse vector (xh) creating a first set of global update values (a, b, c). Similarly, candidate information may be calculated according to EQN. 10 and the resulting node state may be calculated according to EQN. 14, etc.

At step 1206 of the method 1200, the core of the multicore architecture provides a sparsified neighborhood state to other cores. In an exemplary embodiment, calculating the updated localized portion of the updated global activation vector includes applying a rectified linear (ReLU) activation function to its neighborhood state. The localized portion may then be combined with localized portions of other cores to generate an updated global activation vector for use in the next iteration.

To further illustrate how the above implementations for operating a neural network architecture can be performed, illustrative pseudocode is provided below for a single core operation and multicore operation. The pseudocode is provided for illustrative purposes, and other code can be used to implement the algorithms described above as would be understood by one of ordinary skill, given the contents of the present disclosure.

FIG. 13 shows a segment of pseudocode 1300 for operating a neural network on a single core. Pseudocode segment 1302 initializes the dense local (neighborhood) activation vector (h) and the sparse global activation vector (h′), each of length N. The pseudocode segment 1302 also initializes the matrices W and D which represent the sparse global weight matrix and the dense local matrix respectively.

A sparse input vector is received at pseudocode segment 1306 and the sparse input vector is concatenated with the sparse global activation vector at pseudocode segment 1308. Pseudocode segment 1310 performs a matrix-vector multiplication with the sparse matrix on the global weight matrix (W) and the concatenated vector (xh) splitting the resulting matrix into three portions a, b, and c. Pseudocode segment 1312 performs a matrix-vector multiplication with the dense local matrix (D) and dense local activation vector (h) splitting the resulting matrix into three portions e, f, and g. The resulting operations use cluster size (C) times N (vector length) memory lookups and MACs.

Pseudocode segment 1314 uses portions of the results of the multiplications of pseudocode segments 1310 and 1312 to perform remember-forget and reinforcement operations which are used to update the local activation vector (h) at pseudocode segment 1316. At pseudocode segment 1318, the global activation vector is updated as a sparsified snapshot of the local activation vector (h).

At pseudocode segment 1320, operations loop back to pseudocode segment 1304 to receive the next input.

FIG. 14 shows a segment of pseudocode 1400 for operating a single core of a multicore neural network. Multiple nodes may be emulated on each core of the multicore neural network. Each core then may act as neighborhood with dense data and communication that does not need to get distributed throughout the broader global multicore neural network, thus creating a physical and logical hierarchy for nodes in the neural network. Embodiments of the disclosed system can exploit this hierarchy and create more efficient ways to store data (closer to where it is needed), transmit data across the network (sending sparse data over dense data), consume less power (as data is physically closer to where it is used), and offer greater performance.

Pseudocode segment 1402 initializes the dense local (neighborhood) activation vector (h), of length NPC (nodes per core), and the sparse global activation vector (h′), of length N. The pseudocode segment 1402 also initializes the matrices W and D which represent the sparse global weight matrix and the dense local matrix respectively. Constants CORES (the number of cores in the multicore architecture), CIDX (the identifier of the present core), and NPC (the number of nodes per core) are defined in pseudocode segment 1402.

A sparse input vector is received at pseudocode segment 1406, and each core broadcasts its piece of the global activation vector (h′ to each other core in the multicore architecture. Similarly, the core receives pieces of the global activation vector from other cores. The broadcasted pieces are used to (re-)create the full global activation vector based on core identifiers (segment 1408).

In the illustrated pseudocode segment 1400, each core is assumed to have the same number of nodes per core. Other embodiments may vary the number of nodes per core; asymmetric variants may be useful where data dependencies differ across cores. In such embodiments, additional information may need to be used to (re-)create the full global activation vector (e.g., the number of nodes for the core, etc.)

In pseudocode segment 1410, the sparse input vector is concatenated with the sparse global activation vector for use in matrix-vector multiplication. Thereafter, pseudocode segment 1412 performs a matrix-vector multiplication with the sparse matrix on the global weight matrix (W) and the concatenated vector (xh) splitting the resulting matrix into three portions (a, b, and c). Pseudocode segment 1414 performs a matrix-vector multiplication with the dense local matrix (D) and dense local activation vector (h) splitting the result into three portions (e, f, and g).

Pseudocode segment 1416 uses portions of the results of the multiplications of pseudocode segments 1412 and 1414 to perform remember-forget and reinforcement operations which are used to update the local activation vector (h) at pseudocode segment 1418. At pseudocode segment 1420, the core specific portion of the global activation vector is updated as a sparsified snapshot of the local activation vector (h′).

At pseudocode segment 1422, the core specific portion of the global activation vector is broadcasted to the other cores in the multicore architecture. At pseudocode segment 1424, operations loop back to pseudocode segment 1404 to receive the next input.

Operational Efficiency Tradeoffs, Sparsity and Density

The exemplary embodiments described herein provide a plethora of advantages that improve the functioning of neural networks in computer processes. Notably, the exemplary dense local, sparse global processing described herein provide unconventional technical solutions for hardware-aware neural network processing. The sparsity (or density) of a data structure may be calculated by dividing the number of non-null/non-empty (or null/empty) elements to the total number of elements. Sparsity and density may be terms of absolute or relative degree. For example, a data structure may be considered sparse if most of its values (greater than 50%) are null values; similarly, a first data structure may be more sparse than a second data structure even where both data structures are dense (i.e., mostly non-null). While any data structure may be considered relatively sparse or dense, there may be propagation/storage efficiencies as the data becomes sparser and/or computational efficiencies to packing data more densely. The following discussion characterizes various operational tradeoffs that may be made by changing sparsity/density.

Sparse global connection matrices have approximately αβN² parameter lookups, where α is activation sparsity and β is parameter sparsity. Dense local parameter matrix lookups are characterized by CN. Thus, parameter lookups for 6 global weight matrices (e.g., W_(ir), W_(iz), W_(in) and W_(hr), W_(hz), W_(hn)) and their corresponding 3 local matrices (e.g., D_(hr), D_(hz), D_(hn)) is given by 6×αβN²+3CN. Consider a neural network of 1024 nodes (N=1024) that is grouped into dense neighborhood clusters of 32 nodes apiece (C=32), having sparse global interconnections characterized by α,β=0.1. Such a network would require 160,000 parameter lookups which compares much more favorably (40×) to systems that rely on 6 brute force O(N²) parameter lookups (˜6000000 parameter lookups for an equivalent system).

Notably, parameter lookups also differ based on usages. In the foregoing example, there are 6 sparse global matrices having an α=0.1; or 6×1024²×0.1≈600×10³ parameters. In contrast, dense neighborhood parameters are localized to 3 memories with an assumed density of 1.0. Thus, dense parameters consume 3×32×1024≈100³ entries. In other words, even though dense lookups dominate the lookup count, there are only 1/6th as many dense parameters. As a practical matter, the difference in utilization between global and neighborhood parameters may be leveraged in a variety of different ways. For example, different implementations may seek to further increase the disparity of global/local parameterization to further improve performance. Alternatively, global/local parameterization may be load balanced to reduce device wear, etc.

More generally, the various principles described herein address specific memory access, processing bandwidth, and/or utilization issues that are specific to hardware-aware neural networks; these are unique and distinct from well-understood, routine, and/or conventional solutions implemented within traditional neural network computing.

Exemplary Hardware-Aware Mapping and Partitioning

FIG. 15 is a logical flow diagram of an exemplary method 1500 for optimizing standard machine learning frameworks for multicore architectures, in accordance with the various principles described herein. While the following discussion is presented in the context of the exemplary multicore architecture 200 described above, the hardware-aware mapping and partitioning techniques described herein may improve the performance of any multicore neural network implementation. For example, even naïve implementations of multicore networks via the general compute system 100 of FIG. 1 would benefit from mapping and/or partitioning portions of the neural network processing to specific cores 112A, 112B . . . 112N so as to reduce in-core parameter storage and activations, and off-core data transfers.

At step 1502 of method 1500, a logical model of a neural network is synthesized to device-specific primitives. As a brief aside, existing neural networks may be designed in a variety of design languages. For example, the most common machine learning frameworks (e.g., PyTorch, Tensorflow, etc.) use graphical representations of “machine models” to describe how nodes of the network are connected to one another, etc. Machine learning frameworks may include software libraries/application programming interfaces (APIs) for machine learning to perform training and/or statistical inference of (deep) neural networks. These frameworks may offer building blocks for designing, training, and validating deep neural networks through a high-level programming interface for a user.

As used herein, the term “logical model” (also referred to as a learning model or a prediction model) refers to any schema for representing the structure of a neural network. Structural descriptions of a neural network may specify individual node functionality, node connectivity, and/or groups of nodes (e.g., layers), to take desired inputs and generate desired outputs. For example, a logical model may use training data to derive the desired outputs from weighted combinations of input variables. Ideally, the logical model generates a neural network mapping that works for the training data, as well as similar real-world data.

In one exemplary embodiment, the logical model is a graphical representation of computation that includes a flow/directed graph where each node of the graph represents one or more atomic operations, and where each node may be annotated with a node name, a node type, operations/actions, data, points where data flows into or out of the chip, and input/output or communication nodes.

As used herein, the term “primitive” refers to any indivisible unit of operation/functionality that is specific to a device. An indivisible unit of operation cannot be further subdivided, e.g., a software primitive may be an opcode, a hardware primitive may be a combinatorial or sequential logic, etc. For example, a specific field programmable gate array (FPGA) would support a specific instruction set, and look-up-table logic, etc.

In one exemplary embodiment, design synthesis is further staged into three sub-steps: atomization (sub-step 1504), quantization (sub-step 1506), and tracing (sub-step 1508). At sub-step 1504, the model may be atomized into a set of fundamental operations, referred to as “atomics” or “atomic operators”. Atomics are related to, but abstracted from, primitives. Atomics include the set of linear algebra and pointwise operations that are common to all machine learning operations and may be agnostic to software and/or hardware details During this stage, simple layers of the model (e.g., dense feedforward layers) may directly map to a single atomic operator. Higher-level layers may be decomposed into more basic operations before mapping to atomic operations. In an embodiment, a dictionary or look-up table (LUT) is be used to map operators from the source machine learning framework to perform the conversion of operations to the set of atomics.

In some embodiments, the model may additionally be converted to an intermediate representation (e.g., a Patch Intermediate Representation (PatchIR)) for quantization aware training. The intermediate representation may enable users to continue training models after atomization and quantization, while still remaining compatible with the source machine learning framework. Each dialect of the intermediate representation may include a computation graph parser to interpret the input model from the source framework, a library of quantization-aware-training friendly implementations of each of the atomic operators written in the source framework, and a dictionary mapping machine-learning layers from the source framework to atomic operators or compositions of atomic operators.

Notably, design synthesis may have many ways to map logical functionality to device primitives, e.g., an adder could be implemented within many different software primitives and/or hardware primitives. This can be further complicated where exact functionality is not required (e.g., where device operation can acceptably deviate from the idealized neural network model). To reduce synthesis complexity and/or improve synthesis results, additional machine learning layers can be added to, or substitute for existing layers, of the logical model. The additional layers may provide reduce constraints on synthesis/fitting; for example, sparsifier layers introduce activation sparsity that can be tagged as prune-able activations; prune-able activations can improve regularization (described in greater detail below). Similarly, sparsifiable recurrent neural network layers (such as the modified GRU described in FIG. 5, above) can be substituted for generic recurrent neural network layers (e.g., the GRU of FIG. 4, above). Additionally, certain functionality (such as spectral transformation layers, encryption/decryption engines, etc.) may be more efficiently performed in specialized logic; as but one such example, dedicated logic for Short-Time Fourier Transform (STFT) may be used to convert raw waveform audio into the time-frequency domain as inputs to the neural network.

Typically, logical machine-learning models use floating-point representations for parameters and activations; since, most embedded devices are implemented with fixed-point data structures, differences in behavior due to floating/fixed-point conversion should be resolved and/or trained to compensate. Conceptually, logical models should be significantly compressed by quantizing variables to use fewer bits without suffering significant losses in accuracy. Empirical results suggest that quantization may provide similar functionality at a fraction of the logical model's memory footprint (a factor of 8× reduction), simplify the processing logic, and reduce the latency and energy costs of operation.

At sub-step 1506, high-precision floating-point operations are quantized and approximated with lower-precision integers. In one embodiment, quantization may convert floating-point representations (32-bit or 64-bit) to integer representations (INT16, INT8, INT4, etc.) In one specific variant, the quantization may be parameterized into bit-depth and shift-amount for each atomic operator.

In one exemplary embodiment, the quantization may be iteratively optimized by processing a set of representative inputs with the model, and adjusting the quantization based on collected statistics from the traced output (discussed below, at sub-step 1508). The mapping between floating-point numbers (x) and its integer representation (for a given quanta (q) and bit width (b)) may be given by:

$\begin{matrix} {{Q\left( {x,b,q} \right)} = {{clamp}\left( {{{round}\left( \frac{x}{2^{q}} \right)},{- 2^{b - 1}},{2^{b - 1} - 1}} \right)}} & {{EQN}\mspace{14mu} 21} \end{matrix}$

For known floating-point ranges, q and b may be chosen to minimize the Quantization Mean Square Error (QMSE) represented in EQN 22.

$\begin{matrix} {{QMSE} = {\frac{1}{N}{\sum_{i = 1}^{N}\left( {x_{i} - {2^{q}{Q\left( {x_{i},b,q} \right)}}} \right)^{2}}}} & {{EQN}\mspace{14mu} 22} \end{matrix}$

For unknown floating-point ranges, a statistical approximation may be used to generate a range of q values to estimate an optimum mapping. As but one such example, the GQMSE (Gaussian QMSE) may be used where x is normally distributed (x˜N(μ, σ); GQMSE may be calculated by EQN 23.

$\begin{matrix} {{GQMSE} = {\frac{1}{\sqrt{2\;{\pi\sigma}^{2}}}{\int_{- \infty}^{\infty}{{e^{{- \frac{1}{2}}{(\frac{x - \mu}{\sigma^{2}})}^{2}}\left( {x - {2^{q}{Q\left( {x,b,q} \right)}}} \right)}^{2}{dx}}}}} & {{EQN}\mspace{14mu} 23} \end{matrix}$

In some embodiments, quantization sub-step 1506 may include granular quantization control that enables different precisions for different layers of a model (“heterogenous precision”). In such cases, a user may tag different layers in their model with different precisions. For example, spectro-temporal input data could flow through a model along two paths to implement selective noise reduction. One path's output is a time-frequency mask that indicates which time-frequency bins are noise and which are signal. This mask is applied to the other path containing the original input time-frequency data. While the mask and other data can be computed at 8 or 4 bits (“standard”), the input data may be best preserved at 16 bits (“double”). Various implementations may strictly (or loosely) obey user tagged precisions during quantization. Untagged layers may be assigned a default precision by the mapping algorithm.

Similarly, certain embodiments may include different options for vector and matrix precision at different layers. For example, “standard” precision may be 8-bit integers for vectors and 4-bit integers for matrices. “Eights” precision may be 8-bit integer precision for both vector and matrix values. “Double” precision may be 16-bit integers for vector and 8-bit integers for matrices.

While the foregoing discussion is presented in the context of specific data structures, artisans of ordinary skill in the related arts will readily appreciate that virtually any data structure of any dimensionality may be substituted with equal success. Examples of such data structures may include signed/unsigned integers, floating-point of any precision, and/or any other data representation.

As shown in FIG. 16, circles within the hierarchy of layers 1600 depict different layers in the machine learning model. In hierarchy of layers 1600, a heterogeneous precision is applied to a model by tagging layers with different precisions. Layer 1602 is tagged with double precision, while layer 1604 is tagged with standard precision. Arrows depict parent-child relationships in the layer hierarchy. During automatic quantization, precisions are chosen for untagged layers based on the layer hierarchy (i.e., untagged layers inherit their precision from their parent in the hierarchy). Layers that do not inherit a precision from a parent layer may be set to the default precision (standard).

Referring back to sub-step 1506 of FIG. 15, certain aspects of neural network operation may be analyzed and/or trained on. In one embodiment, commonly used functions (sigmoid, tanh, sqrt, log, reciprocal, etc.) may be stored in look-up-tables (LUTs) because it may be more efficient to read a value out of a table than to evaluate the function on-the-fly (e.g. via Taylor Series).

Unfortunately, in some situations, the limited input address space of the LUT operation can bottleneck performance at higher precision operations. For example, an 8-bit addressable LUTs may only store 256 outputs for 256 linearly spaced inputs, however the activation precision level may be configured for either 8-bit or 16-bit output values. As a result, during “double” precision, INT16 activations may be compressed to INT8 before they can address the LUT. This compression may introduce quantization error (e.g., Quantization Mean Squared Error) during operation.

Some embodiments may use linear interpolation (piecewise linear approximation) to mitigate quantization errors. Linear interpolation calculates a local slope f′({tilde over (x)}) between neighboring entries in the LUT to approximate f(x). The interpolated approximation to f(x) is given by EQN. 24, where xis the output precision (e.g. 16-bit) and is the compressed input precision (e.g. 8-bit).

f(x)≈f({tilde over (x)})+(x−{tilde over (x)})f′({tilde over (x)})   EQN. 24:

Other embodiments may use telescoping functions that change sensitivity over different ranges. A telescoping approximation compresses the input into several levels. For instance, a two-level telescoping approximation combines two evaluations of a function—one for the coarsely compressed input and one for the finely compressed input. The coarsely compressed input preserves the original dynamic range of the input but has a large step size. The finely compressed input has a small step size which preserves the original input's granularity but may only be valid for inputs with small magnitudes. Telescoping functions are often suitable for functions that satisfy the additive identity of EQN. 25 (e.g., logarithms) or the multiplicative identity of EQN. 26 (e.g., monomials of the form f(x)=αx^(p)):

f(k*x)=f(k)+f(x)   EQN. 25:

f(k*x)=f(k)*f(x)   EQN. 26:

Referring back to sub-step 1508 of FIG. 15, the effects of quantization are traced to aid subsequent iterative and/or training. In one exemplary embodiment, each atomic operator (from sub-step 1504) is annotated with the quantization parameters (from sub-step 1506) to determine the resulting data flow. The data flow between atomic operators can be represented as a set of directed edges between nodes in a node graph. Unconstrained portions of the node graph may receive default configurations and/or derive their configurations from constrained portions. For example, a node may assume that its input is the same bit-width as its data source (e.g., an upstream node). The data flow may be further annotated with attributes that aid the compiler, such as vector shapes and sizes, activation sparsity, parameter sparsity, and/or data types.

As previously noted, logical models of neural networks are trained to generate output data, based on training data sets. However, various embodiments of the present disclosure may incorporate device-awareness into the training process (step 1510). While the following discussion is presented in the context of hardware-aware training, artisans of ordinary skill in the related arts will readily appreciate that the concepts may be broadly extended to any awareness-based training. Within the context of neural network training, the terms “aware”, “awareness”, and its linguistic derivatives refer to training techniques that compensate, leverage, or otherwise adjust for, parameterized capabilities, limitations, and/or functionalities. For instance, hardware-aware training may be based on limitations (or capabilities) of the processor, memory, and/or logic gates. Examples of such limitations (or capabilities) may include processing speed, memory size, gate numerosity, etc. Software-aware training may be based on limitations (or capabilities) of the software execution; examples of such parameters may include data structure sizes, permissions, memory allocations, locking access, etc. The concepts described herein may be broadly applied to any resource that may affect real-world operation; for example, training may be modified based on e.g., power consumption, network bandwidth, and/or any other device, application, system consideration.

In one exemplary embodiment, device-specific training may include Quantization Aware Training (QAT). As a brief aside, logical model training relies on the iterative fine-tuning of a model's parameters with floating-point precision between nodes. Each training iteration could modify parameters by a large range of possible gradient updates. Unfortunately, embedded implementations may only have fixed-point precision; this means training must occur over the range of gradient updates that are representable by fixed-point precision. Consequently, in one specific implementation, “fake-quantization” is used during QAT to simulate the integer arithmetic in the forward pass while allowing for small floating-point gradient updates in the backward pass. Integer operations are simulated using floating-point values by rounding and truncating. It is important to note that the simulation of the underlying integer operations exactly matches the device-specific precision. In the backwards pass, small floating-point gradient updates are percolated back to the parameters. A high-precision floating-point copy of each of the parameters is updated with the gradient updates. In this way, the high precision parameters can accumulate multiple gradient updates before crossing integer boundaries and exhibiting these changes in the forward pass.

More broadly, artisans of ordinary skill in the related arts will readily appreciate that hardware-aware training may be used at multiple points in the design process. In some embodiments, training may occur prior to design synthesis (step 1502). In some embodiments, training may be performed prior to software compilation/hardware partitioning (step 1514, described below). In still other embodiments, training may be an iterative process; training may be performed on preliminary synthesis passes, and results may be fed back to improve subsequent synthesis passes.

In one such implementation, the system may prune parameters and activations during device-specific training of the model (sub-step 1512). Mathematically, untrained neural networks have infinitely many potential ways of generating desired outputs from the inputs; training the neural network selects one solution. While logical neural networks could arbitrarily pick any solution, the exemplary device-aware training adds parameter-sparsity and activation-sparsity to provide maximum flexibility for optimal connectivity (and penalizes sub-optimal connections, discussed below).

As discussed in U.S. patent application Ser. No. ______, filed ______ and entitled “METHODS AND APPARATUS FOR MATRIX AND VECTOR STORAGE AND OPERATIONS”, previously incorporated herein by reference in its entirety, activation sparsity can be used to greatly reduce storage requirements as well as computational complexity. For instance, compression schemes may be used to represent sparse matrices with links to compressed column data structures, where each compressed column data structure only stores non-null entries to optimize column-based lookups of non-null entries. Similarly, sparse vector addressing skips nulled entries to optimize for vector-specific non-null multiply-accumulate operations. In one exemplary embodiment, activation-sparsity can be introduced by adding sparsifier layers to the logical model; the sparsifier layers are tagged as pruneable activations which can be preferentially sparsified and/or pruned during training to avoid undesirable penalties.

Similarly, parameter sparsity may allow users to fit large models into hardware with a limited memory capacity. As illustrated above, parameter sparsity distributes neural network parameters to each of the cores of a multicore architecture; subsequent training can prune the parameters to incentivize local processing and penalize global communications. In this manner, each core is optimized for only a small slice of the overall neural network parameters. In other words, parameter and activation pruning tools may prompt models to achieve higher levels of parameter and activation sparsity when used during device-specific training.

In one specific implementation, the device-specific training algorithm is a multivariate optimization of latency (L), energy (E), and memory (M) based on the activation density (α) and parameter density (β), according to the following equations:

L=αβηL_(o)   EQN. 27:

E=αβηE_(o)   EQN. 28:

M=βM_(o)   EQN. 29:

As previously noted, density is the ratio of non-null elements to the total elements density and sparsity are each positive and sum to one. L_(o), E_(o), and M_(o) are the baseline latency, energy, and memory for the logical model, measured when activations and parameters both have densities of one. These baselines define the theoretical maximum resources needed for a model. Additionally, η is reflects the activation-parameter density affinity i.e., a characterization of the firing rates of neurons and connectivity. When the correlation between parameter and activation densities is zero, η=1. A positive correlation between parameter and activation densities results in η>1 (e.g., non-null activations occur in densely connected neurons); a negative correlation corresponds to η<1 (e.g., non-null activations in loosely connected neurons).

In one exemplary device-specific training process, the neural network is trained to minimize α, β, and η. In one specific implementation, the device-specific training process uses a set of heuristics that penalize activations at a first (general) weight and penalize affinity-dense activations at a second (heavier) weight. Additionally, the device-specific training process may iteratively adjust parameter density; to minimize training complexity, parameter density may be gradually, but irreversibly pruned. Other techniques may allow for more training complexity, and support reversible parameter pruning.

In one implementation, the activation penalty may use a differentiable regularizer that gradually shrinks the activation density (α). In machine learning contexts, regularization is the process of adding information to solve an ill-posed problem or to prevent overfitting; differentiability ensures that the regularization occurs smoothly (continuously); e.g., gradient descent-based training is a first-order differentiable function.

In the exemplary implementation, the activation penally is added to the overall objective function, which is minimized during training via gradient descent during training. The activation penalty Θ is an L1 norm of the model's N prune-able activations (α), given by the equation:

$\begin{matrix} {\Theta = {{\frac{1}{N}{a}_{1}} = {\frac{1}{N}{\sum_{i = 1}^{N}{a_{i}}}}}} & {{EQN}.\mspace{14mu} 30} \end{matrix}$

This rule penalizes neuron activations because each activation is associated with memory access and computational cost. The rule treats all neurons with an equal weighting, regardless of how many neurons they connect to.

Notably, certain neurons should be penalized more heavily than others because they can cause chains of downstream neuron activations; thus, an affinity-aware penalty strives to shrink αη instead of α alone. In one specific implementation, the affinity-aware activation penalty Θ_(η) is an L1 norm of the prune-able activations, weighted by each prune-able neuron's fanout (f_(i) of the ith prune-able neuron), given by the equation:

$\begin{matrix} {{\Theta\;\eta} = \frac{\sum_{i = 1}^{N}{f_{i}{a_{i}}}}{\sum_{i = 1}^{N}f_{i}}} & {{EQN}.\mspace{14mu} 31} \end{matrix}$

In one embodiment, the device-specific training includes a structured, magnitude-based parameter pruning algorithm to gradually reduce the parameter density β during training. Unlike activation pruning, which can be indirectly weighted/trained with a regularization penalty, parameter pruning is performed directly and irreversibly to reflect the realities of hardware implementation (i.e., a core either has or does not have a parameter). At each pruning step, the pruning algorithm selects a number of parameter elements to prune. From this point on, these pruned parameter elements are set to null. Once an element is pruned, it is insensitive to gradient updates and will remain fixed at null for the duration of training.

As discussed in U.S. patent application Ser. No. ______, filed ______ and entitled “METHODS AND APPARATUS FOR MATRIX AND VECTOR STORAGE AND OPERATIONS”, previously incorporated herein by reference in its entirety, the exemplary device may group certain elements together to accelerate certain types of computation. In order to benefit from such hardware-acceleration, the device-specific training may incorporate element grouping in the training process. Specifically, the training process may selectively prune elements of a parameter matrix based on a structured magnitude-based criterion. In particular, pruning decisions are not made at a per-element basis, as this may lead to an unstructured sparsity pattern. Instead, the matrix may be broken down into subcomponents called pencils, and pruning decisions are made per-pencil instead of per-element. In an exemplary embodiment, a pencil is a column vector of 8 elements. For example, a matrix of shape (256, 256) would have 32 pencils per column, for a total of 8,192 pencils. The pencils with the lowest average magnitudes may be selected for pruning, until enough pencils have been pruned to reach the target sparsity level. The pencil structure is used to align with hardware memory interfaces—a read from memory extracts multiple consecutive elements.

At step 1514, the synthesized device-specific primitives are mapped to the device architecture. More directly, the atomic operations that were logically synthesized in step 1502 are agnostic to the physical aspects of device implementation (e.g., timing and task scheduling, placement, race conditions, etc.) The mapping step assigns the device-specific primitives to physical resources of the device and generates the executable software. In one exemplary embodiment, mapping is performed in four stages: partitioning and placement (at sub-step 1516), optimization (at sub-step 1518), code generation (at sub-step 1520), and instruction-level optimization (at sub-step 1522). While the exemplary embodiment performs each stage iteratively, other embodiments may group stages for iteration (e.g., sub-steps 1516 and 1518 may be grouped and iterated over). Artisans of ordinary skill in the related arts, given the contents of the present disclosure will readily appreciate that other implementations may further subdivide, merge, remove, add-to, and/or otherwise modify the mapping sub-stages described herein.

As an optional preliminary sub-step, hardware-agnostic atomic operations may be annotated by a human (either via a textual or graphical interface) for compilation and placement. Each node may be annotated to specify e.g., data type and/or data flow, operators, or markers for other hardware-specific functions such as core-to-core communication. Edges between nodes of the node graph represent data and/or control dependencies. Compiling the annotated representation to assembly code for placement on the hardware may attempt to optimize energy efficiency, packing efficiency, and computation latency, among other areas of optimization (e.g., performance).

Various embodiments of the present disclosure implement a variety of metrics to assess mapping quality. These metrics may be used to iteratively optimize between different mappings. For instance, an energy efficient mapping should execute the computation using as little energy as possible per inference/timestep. The primary contributor to energy consumption is the placement: in certain embodiments, core-to-core communication may be minimized. Similarly, an efficiently packed mapping maximizes core utilization for a given network. In one such implementation, packing efficiency refers to the amount of unused memory in each core; other efficient packings may minimize inter-core communication, etc. The packing efficiency of the mapping limits the network size that can practically fit on a given chip—there is an overhead on the “effective” amount of memory in the system. This indirectly affects energy efficiency since leakage current is a function of the core utilization.

Additionally, certain embodiments may assess the suitability of the mapping for a particular application. For instance, certain applications require computations to be performed as quickly as possible (or within other time constraints). Other applications may have performance and/or power limitations, etc. Still other applications may balance multiple considerations. For example, a solution with sub-optimal latency may increase the amount of time that the system stays in its highest-power state instead of sleeping in a low-power state.

In addition to spatial placement considerations, temporal utilization may also introduce a variety of considerations. As but one such example, a multicore architecture may use parallelism to accelerate processing. Each core may be a multithreaded processor associated with and working from private memory associated only with that processor and capable of running several SIMD instructions simultaneously. Unfortunately, thread-level parallelism may be limited by resource conflicts (e.g., instructions cannot run in parallel if their operands are from the same memory bank). One potential solution is to distribute data and operations across multiple resources to minimize resource conflicts; alternatively, or in addition, resource conflicts can be scheduled around.

Other examples of temporal restrictions include long latency core-to-core (inter-core) communications. This type of parallelism may arise from partitioning large nodes into smaller pieces. Inter-core communications may be minimized at the algorithm-level by communicating sparse vectors, and by keeping communication of dense vectors as local as possible.

Referring back to sub-step 1516 of FIG. 15, portioning and placement generally refers to the process of determining the number of cores needed to support the neural network, and splitting the neural network program into core-specific sub-programs. In one specific implementation, the data and computation loads are balanced across cores, and the program is split with the objective of minimizing the total core-to-core communication. Other implementations may optimize for asymmetric placements; for example, heterogenous multicore architectures may preferentially place certain types of functionality in certain cores. For example, a high-performance core may be coupled with a power efficient core, a highly-connected core, etc. Similarly, some neural networks may incorporate specialized logic (e.g., encryption, codecs, communication protocols, etc.) Various other implementations may be substituted with equal success, by artisans of ordinary skill in the related arts given the contents of the present disclosure.

FIG. 17 is a logical flow diagram of one exemplary implementation of the partitioning and placement sub-step 1516 of FIG. 15.

At step 1702, an initialization pass is performed on the device-specific primitives based on their atomic operators. In one exemplary embodiment, the synthesized node graph of device-specific primitives is input to the mapping algorithm. The edges between nodes of the node graph determine the data or control dependencies between nodes.

In one specific implementation, the mapping algorithm classifies atomic operator functionality into “OpNodes”, “DataNodes”, “TableNodes”, and “CommNodes.” OpNodes describe a mathematical operation, e.g., a matrix-vector multiplication. The mapping algorithm may either unfold OpNodes into device-specific primitives (instructions or logic) during code generation (see sub-step 1520 of FIG. 15) or prune the node out during compilation (see sub-step 1522 of FIG. 15). OpNodes may be annotated with e.g., location information (the core and thread that the OpNode has been assigned to), precision (e.g., standard or double-precision), and/or operation-specific constants (e.g., an immediate value for an immediate addition operation). DataNodes may store data structure information such as: data type, shape, and precision, location (core and bank), a constant value for fixed parameters, and/or sparsity information. TableNodes are used for objects that may be stored in table memory (e.g., look-up-table entries and column addresses for sparse matrix by sparse vector products). CommNodes are specialized logic that allow inter-core communication (see associated discussion at step 1714).

At step 1704, the synthesized node graph is sequenced according to execution order. In some cases, this may require the addition of new edges to the synthesized node graph. For instance, control edges may be added to indicate control dependencies that do not have their own data dependency. Control edges may be used in some specific operations to ensure faithful execution order so this pass may be repeated whenever the node graph changes.

At step 1706, certain matrix and vector operations may be optimized, minimized, or eliminated altogether. For example, matrix transpositions may be eliminated by propagating the operation back to a DataNode. These optimizations may be used to avoid computationally expensive operations (which may not be supported on all hardware types).

At step 1708, shift amounts may be standardized. For example, out-of-range shift amounts may be clipped or corrected. Standardized shift logic can reduce specialized logic for corner cases.

At step 1710, large OpNodes and their associated DataNodes may be split into smaller shards that can be placed on different cores (at step 1712), allowing for better memory balance and execution time for large operations.

At step 1712, each OpNode is assigned to a core. The amount of memory and computational work assigned to each core may be balanced. The partitions may be tiled in a way that minimizes the number of hops for each communication. As previously alluded to, different OpNode placement result in different performances; different mappings may be assessed according to the metrics described above (e.g., energy efficiency, packing efficiency, computational latency, etc.)

At step 1714, communication nodes are inserted at core-to-core and core-to-chip input/output (IO) boundaries. Communication nodes may be placeholders which are later expanded into communication instructions. DataNodes falling on a chip boundary may be replicated on both sides (where each core gets its own copy of the data).

At step 1716, DataNodes are assigned to the cores that were assigned with their neighboring OpNodes. In an alternative embodiment, this assignment may occur in a combined pass with step 1712.

While the foregoing discussion is presented in the context of a sequential order, it is appreciated that multiple iterations of partitioning and placement may be used in a trial-and-error manner to identify as a suitable partitioning/placement. In each pass, the code/representation may be assessed multiple times, potentially with different assessment heuristics (e.g., energy efficiency, packing efficiency, and computation latency, etc.)

Returning back to sub-step 1518 of FIG. 15, each core's sub-graph is optimized to prepare for code generation. As large operations were previously (spatially) split across cores, large operations within one core may be further (temporally) split across multiple threads to enable faster operation. In one exemplary embodiment, the sub-graph nodes can be user annotated with scheduling and/or timing hints to smooth code generation. In one such embodiment, a designer may assign data (or DataNodes) to specific memory banks during code generation.

FIG. 18 is a logical flow diagram of one exemplary implementation of the placement optimization sub-step 1518 of FIG. 15.

At step 1802, neural network operations are split into multiple parallel operations, each meant to be executed by a single thread. In one exemplary embodiment, each thread may be allocated its own data path to reduce execution time, as discussed in U.S. patent application Ser. No. ______, filed ______ and entitled “METHODS AND APPARATUS FOR THREAD-BASED SCHEDULING IN MULTICORE NEURAL NETWORKS”, previously incorporated herein by reference in its entirety. As described therein, threads may run independently of one another, without any centralized scheduling and/or resource locking (e.g., semaphore signaling, critical path execution, etc.) Decoupling thread dependencies allows cores to execute threads asynchronously.

In some embodiments, DataNodes for parameter matrices may be and/or copied so they can be accessed without contention. Similarly, parallelized OpNodes (and their corresponding DataNodes) cannot concurrently access the same resources. Ideally, conflicts can be avoided, but where a conflict must occur the parallelism will be limited (the instructions of either thread must directly, or indirectly, be serialized due to the resource conflict).

At step 1804, a sparsifying pass is performed. Dense data operations (e.g., dense matrix by dense vector products) may be converted to sparse data operations (e.g., sparse matrix by sparse vector products), if the sparsity and average sparsity values of the original DataNodes make it advantageous. In one embodiment, sparsification occurs when either only the matrix or the vector (but not both) are sparse (implementing a sparse matrix by sparse vector product where either the matrix or the vector that was originally dense may be inefficiently stored but may offer overall optimization).

For example, as described within U.S. patent application Ser. No. ______, filed ______ and entitled “METHODS AND APPARATUS FOR MATRIX AND VECTOR STORAGE AND OPERATIONS”, previously incorporated herein by reference in its entirety, matrices and vectors may be tagged as sparse or dense depending on their contents. The matrix-vector multiply math can either be performed with an instruction designed for dense data or an instruction designed for sparse data. In one specific implementation, the tagging directs which instructions are used to compute the math on the chip at run-time.

Referring now to sub-step 1520 of FIG. 15, software code for each core's sub-graph is generated. In one specific implementation, each core's sub-graph is allocated to memory banks (using hints provided during optimization) and converted into machine-readable instructions. In one such implementation, communication code generation and thread control passes may also be added to the program(s).

FIG. 19 is a logical flow diagram of one exemplary implementation of the code generation sub-step 1520 of FIG. 15.

At step 1902, OpNodes and CommNodes are assigned to threads, based on grouping rules. For example, chains of operations in the node graph may be grouped together and sequentially executed. Parallel chains of operation may be executed concurrently. In some cases, a single chain may be split, or multiple chains may be sequenced e.g., to improve performance, reduce core utilization, etc. For instance, certain operations may be re-ordered to save on loads and stores by keeping values in the accumulator instead of writing out to memory.

In one embodiment, CommNodes mark core-to-core communication boundaries. Unidirectional inter-core communication may further specify whether a CommNodes is sends or receives data; e.g., each CommNode may include an attribute that encodes whether it is a ‘send’ or ‘recv’ node. CommNodes may be inserted into the node graph at inter-core boundaries after partitioning (see step 1516). During code generation, these nodes are used to construct the communication threads that contain SEND, RECV, and RDY instructions.

In some embodiments, CommNodes may additionally support other communication protocols and/or inter-chip communications. For example, an IONode may be used to communicate across a chip boundary (to another chip). While unidirectional communication (send/receive) is disclosed, bi-directional, multi-cast, and/or broadcast communication may be substituted with equal success by artisans of ordinary skill, given the contents of the present disclosure.

At step 1904, data is padded to a multiple of the pencil size. This pass may be used in some embodiments, particularly where there is no-sub word indexing in the instruction set architecture. For example, a length-5 vector may be padded into a length-8 vector, with the 3 final elements not being used in the computation. Unused elements consume memory storage, but reduce addressing complexity; thus, different pencil dimensions may be assigned in accordance with overall design considerations (e.g., energy efficiency, packing efficiency, computational latency, etc.)

At step 1906, the sizes of DataNodes and TableNodes is computed in units of memory words. This may be used later when generating a bank assignment for all variables at step 1908.

At step 1908, DataNodes and TableNodes are assigned to banks of their respective memory types. In some embodiments, the assignment uses variable sizes. This pass tries to respect the thread concurrencies that were assigned to DataNodes previously (see step 1518 of FIG. 15), while also attempting to balance the memory assigned to each bank (maximizing packing, minimizing fragmentation).

At step 1910, Assembly Code is generated for the OpNodes in each thread. In code generation, an object-oriented representation of assembly language, is generated for each of the threads that were optimized and assigned to the operation nodes (see step 1518). In addition to the “arithmetic” part of the code (declaration of data variables and emission of instructions that perform the various operations), “thread control” instructions may also be created to ensure the correct concurrent control flow.

After initializing the assembly code program object, the system begins by declaring the data variables associated with each core's DataNodes (using the computed bank assignments). Next the compiler generates a code snippet for each OpNode based on the node's constants as well as the DataNodes it is attached to. In some embodiments, the compiler may be agnostic to accumulator state; in such cases, the code snippets are emitted with the maximum set of loads and stores (e.g., making no assumptions about whether any of the variables needed by the OpNode are already in the accumulator). In another embodiment, the compiler may optimize the instruction order and keep track of accumulator residency operation-by-operation to optimize and reduce unnecessary loads and stores.

At step 1912, assembly code is generated for CommNodes. Each nodes' code is emitted into its own dedicated thread. Receive nodes may also add “special thread” sections to provide for subsequent inter-core communication flexibility.

At step 1914, inter-thread control passes are performed. In one specific implementation, the compiler inserts scoreboard (SB), sleep (SLEEP), and jump (JUMP) commands to implement thread-to-thread sequencing. Inter-thread communication is described in greater detail within U.S. patent application Ser. No. ______, filed ______ and entitled “METHODS AND APPARATUS FOR THREAD-BASED SCHEDULING IN MULTICORE NEURAL NETWORKS”, previously incorporated herein by reference in its entirety. In one such implementation, a “thread graph” is constructed with one node per thread. The thread graph constructs an edge to another thread graph whenever an OpNode in one thread has a data or control edge that terminates in another thread. The number of inbound edges is each thread's initial score, and each thread's scoreboard may decrement each of its successors in the thread graph by one before it sleeps and increments its own score back to the initial score. Other schemes for inter-thread control may be substituted with equal success.

Returning to FIG. 15, at step 1522 the resulting assembly code is optimized at the instruction-level. For example, the assembly code is modified to remove inefficiencies (e.g., unnecessary load and store instructions). The assembly code may be checked to ensure correct operation of the pass.

At step 1524, a behavioral simulator may be run which may allow for verification of the operation programs and estimation of the physical costs (energy, area, time) of the program. The behavioral simulator may include a module to model operation of a program and track operation counts and approximate hardware concurrency. The behavioral simulator loads the generated assembly code and a hardware configuration description and runs test inputs through the simulation extracting metrics for the input pattern and given hardware configuration. Output may include estimated area, energy, and latency metrics.

At step 1526, machine code is generated by an assembler. In some embodiments, the generated code is a binary executable that can be run on the optimized hardware (such as architecture 200). In some embodiments, one or more listing files are created which may contain information about data memory, table memory, instruction memory, and the symbol table. This information may be helpful for debugging and analysis.

At step 1528, the generated machine code (and associated data) may be placed and run on the hardware e.g., a System on a Chip (SoC), an FPGA, or printed circuit board (PCB).

It will be appreciated that the various ones of the foregoing aspects of the present disclosure, or any parts or functions thereof, may be implemented using hardware, software, firmware, tangible, and non-transitory computer-readable or computer usable storage media having instructions stored thereon, or a combination thereof, and may be implemented in one or more computer systems.

It will be apparent to those skilled in the art that various modifications and variations can be made in the disclosed embodiments of the disclosed device and associated methods without departing from the spirit or scope of the disclosure. Thus, it is intended that the present disclosure covers the modifications and variations of the embodiments disclosed above provided that the modifications and variations come within the scope of any claims and their equivalents. 

What is claimed is:
 1. A neural network processing apparatus, comprising: a plurality of cores; and one or more memories configured to store a first set of global parameters and a second set of local parameters; wherein each core comprises logic configured to: obtain a first input vector; perform global neural network processing based on the first input vector and the first set of global parameters; perform local neural network processing based on the second set of local parameters and a previous activation state vector of the core; and generate a result vector for distribution to the plurality of cores.
 2. The neural network processing apparatus of claim 1, wherein the first set of global parameters comprises a portion of a global matrix associated with each core of the plurality of cores.
 3. The neural network processing apparatus of claim 1, further comprising logic configured to sparsify a dense vector to produce a resulting sparse vector.
 4. The neural network processing apparatus of claim 1, wherein: each core further comprises logic configured to: obtain one or more portions of a global activation vector from other cores of the plurality of cores; create the global activation vector based on the one or more portions of the global activation vector from the other cores of the plurality of cores; and perform the global neural network processing based on the global activation vector.
 5. The neural network processing apparatus of claim 1, wherein each core further comprises logic configured to broadcast the result vector to each other core of the plurality of cores.
 6. The neural network processing apparatus of claim 1, wherein each core further comprises logic configured to update the previous activation state vector of the core based on the result vector.
 7. The neural network processing apparatus of claim 1, wherein: the one or more memories comprises a first memory and a second memory; the first memory of the one or more memories is exclusively associated with a first core of the plurality of cores configured to store a first portion of the first set of global parameters and a second portion of the second set of local parameters associated with the first core; and the second memory of the one or more memories is exclusively associated with a second core of the plurality of cores different from the first core of the plurality of cores and is configured to store a third portion of the first set of global parameters different from the first portion and a fourth portion of the second set of local parameters different from the second portion associated with the second core.
 8. A method of operating a core of a multicore neural network architecture comprising: receiving a sparse activation vector; retrieving a dense activation vector, a portion of a global weight matrix, and a local matrix; calculating an updated dense activation vector based on the sparse activation vector, the dense activation vector, and the portion of the global weight matrix, and the local matrix; calculating a portion of an updated sparse activation vector based on the updated dense activation vector; and broadcasting the updated sparse activation vector to other cores of the multicore neural network architecture.
 9. The method of claim 8, wherein calculating the portion of the updated sparse activation vector includes applying a rectified linear activation function to the updated dense activation vector.
 10. The method of claim 8, wherein the dense activation vector, the portion of the global weight matrix, and the local matrix are retrieved from a memory associated with the core.
 11. The method of claim 10, wherein the memory is not associated with the other cores of the multicore neural network architecture.
 12. The method of claim 8, wherein calculating the portion of the updated sparse activation vector comprises: concatenating a sparse input vector and a sparse global activation vector to form a concatenated sparse vector; multiplying the portion of the global weight matrix with the concatenated sparse vector creating a first intermediate data structure; multiplying the local matrix with the dense activation vector creating a second intermediate data structure; performing a first sigmoid function on a first sum of a first section of the first intermediate data structure added to a second section of the second intermediate data structure creating a third intermediate data structure; performing a second sigmoid function on a second sum of a third section of the first intermediate data structure added to a fourth section of the second intermediate data structure creating a fourth intermediate data structure; performing a hyperbolic tangent function on a third sum of a fifth section of the first intermediate data structure and a first product of the third intermediate data structure and a sixth section of the second intermediate data structure creating a fifth intermediate data structure; and calculating the updated dense activation vector based on a fourth sum of a second product of the fourth intermediate data structure and the sparse global activation vector and a third product of the fifth intermediate data structure and a difference of one and the fourth intermediate data structure.
 13. The method of claim 8, further comprising receiving a sparse input vector and the sparse activation vector, from a broadcast communication to a plurality of cores of the multicore neural network architecture.
 14. The method of claim 8, wherein broadcasting the updated sparse activation vector to the other cores of the multicore neural network architecture occurs asynchronous to broadcasts from the other cores.
 15. A non-transitory computer readable apparatus comprising a storage medium having one or more computer programs stored thereon, the one or more computer programs, when executed by a processing apparatus, being configured to: obtain a first sparse vector; perform global neural network processing based on the first sparse vector and a first set of global parameters; perform local neural network processing based on a second set of local parameters and a dense vector that is specific to each core; and sparsify the dense vector to generate a second sparse vector for broadcast to a plurality of cores.
 16. The non-transitory computer readable apparatus of claim 15, wherein the first sparse vector comprises an input vector and the first set of global parameters comprises a portion of a global matrix associated with a first core.
 17. The non-transitory computer readable apparatus of claim 15, wherein sparsifying the dense vector comprises skipping elements or adding null elements to the dense vector.
 18. The non-transitory computer readable apparatus of claim 15, wherein sparsifying the dense vector comprises applying a rectified linear activation function to the dense vector.
 19. The non-transitory computer readable apparatus of claim 15, wherein each core further comprises logic configured to broadcast the second sparse vector to each other core of the plurality of cores.
 20. The non-transitory computer readable apparatus of claim 15, wherein the one or more computer programs are further configured to broadcast the second sparse vector to the plurality of cores. 